{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 决策树"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 决策树算法是一种基本分类与回归算法。决策树模型呈树型结构，在分类问题中，表示基于特征对实例进行分类的过程。它可以认为是if-then规则的集合，也可以认为是定义在特征空间与类空间上的条件概率分布。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 决策树学习包括特征选择，决策树的生成和决策树的修剪。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 决策树的定义"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 分类决策树模型是一种对实例进行分类的树型结构。决策树由结点和有向边组成。结点有两种，内部结点和叶节点。内部结点表示一个特征或属性，叶节点表示一个类。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 决策树原理"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 信息熵&信息增益"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 熵指的是体系的混乱的程度，信息论中的熵是一种信息的度量方式，表示信息的混乱程度，也就是说：信息越有序，信息熵越低。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 信息增益：在划分数据集前后信息发生的变化称为信息增益。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 决策树工作原理"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 构建决策树"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def createBranch():\n",
    "    '''\n",
    "    检测数据集中的所有数据的分类标签是否相同：\n",
    "        if so return 类标签\n",
    "        else:\n",
    "            寻找划分数据集的最好特征\n",
    "            划分数据集创建分支节点\n",
    "                for 每个划分的子集\n",
    "                    调用函数createBranch并增加返回结果到分支节点中\n",
    "            return 分支节点\n",
    "    '''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 决策树算法特点"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 优点：计算复杂度不高，数据有缺失也可以跑，可以处理不相关特征。\n",
    "缺点：容易过拟合。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 决策树项目案例"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 判定鱼类和非鱼类"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 根据以下2个特征，将动物分成两类：鱼类和非鱼类。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 特征：\n",
    "  * 1.不浮出水面是否可以生存\n",
    "  * 2.是否有脚蹼"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 输入数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def createDataSet():\n",
    "    dataSet = [[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no']]\n",
    "    labels = ['no surfacing','flippers']\n",
    "    return dataSet,labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 为了计算熵，我们需要计算所有类别所有可能值包含的信息期望值，通过下面的公式得到："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$H=-\\sum_{i=1}^{n}p(x_i)log_2p(x_i)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 计算给定数据集的香农熵的函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calcShannonEnt(dataSet):\n",
    "    #求list的长度，表示计算参与训练的数据量\n",
    "    numEntries = len(dataSet)\n",
    "    #计算分类标签label出现的次数\n",
    "    labelCounts = {}\n",
    "    for featVec in dataSet:\n",
    "        #将当前实例的标签存储，即每一行数据的最后一个数据代表标签\n",
    "        "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
